Section 1 : INTRODUCTION

1.1 About Capital Bike Share

Capital Bikeshare is a bike-sharing system that has transformed the transportation landscape in the Washington D.C. area, providing a flexible and convenient mode of transportation for residents and visitors. This innovative system was launched in 2010 and has since then grown in popularity, with over 5 million trips taken in 2019 alone. The system has significantly reduced traffic congestion, improved air quality, and promoted healthy lifestyles by providing an affordable and accessible transportation option. Capital Bikeshare operates on a cost-recovery model, which means that user fees and sponsorships cover the cost of operations and maintenance, making it a public service that aims to provide a sustainable and equitable transportation option for the community. With the introduction of electric-assist bicycles, the system has become even more attractive, making it easier for people to travel longer distances and tackle hilly terrain. The success of Capital Bikeshare has inspired the launch of similar bike-sharing systems in other cities, promoting the shift towards sustainable and environmentally-friendly transportation options.

The purpose of the project is to analyze the impact of Capital Bikeshare on transportation, public health, and community access. Using data analysis, we aim to improve the system and promote its benefits to a wider audience. Specifically, goal is to better understand the role of Capital Bikeshare in creating a sustainable and accessible transportation system, while identifying usage patterns and trends that contribute to reducing traffic congestion, improving air quality, and promoting healthy lifestyles.

1.2 SMART Questions:

Overall, we want to know what all factors are impacting the demand of capital bike share. To explore this, we asking the following SMART questions:

  1. What is the impact of seasonality on bike rental demand?

  2. What are the key factors that influence bike rental demand, and how do they affect the overall bike- sharing system performance?

  3. What are the most popular bike stations and routes, and how do they vary by time of day, season, and day of the week?

  4. How does the weather (temperature, humidity, wind speed) impact bike usage patterns?

  5. How does bike usage vary during holidays compared to regular days?

Section 2 : Description of Data

2.1 Source of data

Three distinct sources were utilized to gather the data. The website links for the respective sources are as follows: ​

2.1.1 The Data from CaBi Share contains the variables:

Capital Bike Share

Variable Description
Duration the length of time (in seconds) that the bike was rented for.
Start date the date and time that the bike rental started.
End date the date and time that the bike rental ended.
Start station number the identification number of the bike-sharing station where the rental started.
Start station the name of the bike-sharing station where the rental started.
End station number the identification number of the bike-sharing station where the rental ended.
End station the name of the bike-sharing station where the rental ended.
Bike number the identification number of the bike that was rented.
Member type indicates whether the person who rented the bike was a member or a casual user.

2.1.2 The Data from Weather contains the variables:

Weather

Variable Description
temperature The temperature in degrees Fahrenheit.
feelsliketemp The “feels like” temperature in degrees Fahrenheit, which takes into account factors such as humidity and wind.
dew The dew point in degrees Fahrenheit.
humidity The relative humidity as a percentage.
windspeed The wind speed in miles per hour.
uvindex The UV index, which is a measure of the strength of ultraviolet radiation from the sun.
weather A categorical variable describing the weather conditions (sunny, cloudy, rainy, etc.).

2.1.3 The Data from US Holidays contains the variables:

US Holidays

Variable Description
Date Holiday day and month
holiday Name of the Holiday

The dataset used for this analysis was collected from three sources, namely Cabi Share, weather (visual crossing), and US holidays (time and date). The Cabi Share data spans the time period from October 2010 to March 2023, and thus we filtered the weather and US holidays data for the same duration to ensure consistency. To ensure accuracy and relevance, we filtered the weather data specifically for Washington city, as the Cabi Share data only pertains to this location. The holiday data was extracted from a website and transferred to an Excel sheet, and both R and Excel commands were used to remove any unnecessary columns from the data.

Once all three data sources were filtered and cleaned, we integrated them into a single CSV file. This file was then used for further analysis and modeling purposes. The process of filtering, cleaning, and merging the data ensured that we had a reliable and relevant dataset to work with, enabling us to gain insights into various aspects of bike sharing, weather patterns, and holiday trends. Overall, the use of multiple data sources and rigorous data cleaning and integration processes ensured that our analysis was based on high-quality and accurate data.

2.2 Preprocessing of Data

1.Sources of data : We collected data from three sources, namely Cabi Share, weather(visual crossing), and US holidays (time and date).

  1. Time frame : The Cabi Share data covers the time-period from October 2010 to March 2023. Therefore, we filtered the weather and US holidays data for the same duration of time to ensure consistency.

  2. Filtering weather data : As the Cabi Share data only pertains to Washington city, we decided to filter the weather data specifically for this location. This was done to ensure that the data collected is relevant and accurate for our analysis.

  3. Extracting holiday data : We extracted the holiday data table from a website and transferred it to an Excel sheet. We then used both R and Excel commands to remove any unnecessary columns from the data.

  4. Merging data : Once we had filtered and cleaned all three data sources, we integrated them into a single CSV file. This file was then used for further analysis and modeling purposes.

Overall, this process ensured that we had a reliable and relevant dataset to work with, enabling us to gain insights into various aspects of bike sharing, weather patterns, and holiday trends.

Section 3 : Data Cleaning

  • After the Preprocessing of the data there are 1609103 rows and 18 variables.

The final selected variables are :

Variable Description
started_at The date and time when the bike rental started.
start_station_name The name of the bike station where the rental started.
member_type The type of member who rented the bike (casual or member).
duration The duration of the bike rental in seconds.
noofbikes The number of bikes rented for this rental.
temperature The temperature in degrees Fahrenheit.
feelsliketemp The “feels like” temperature in degrees Fahrenheit, which takes into account factors such as humidity and wind.
dew The dew point in degrees Fahrenheit.
humidity The relative humidity as a percentage.
windspeed The wind speed in miles per hour.
uvindex The UV index, which is a measure of the strength of ultraviolet radiation from the sun.
weather A categorical variable describing the weather conditions (sunny, cloudy, rainy, etc.).
weekday A categorical variable indicating the weekend or weekday.
holiday A variable indicating whether or not the rental occurred on a holiday.
season A categorical variable indicating the season (spring, summer, fall, or winter).
date The date of the bike rental.
month The month of the bike rental.
year The year of the bike rental.
  • There are 603620 NULL values and 0 duplicates in the dataframe.

  • The Variables and their datatypes before cleaning and Formatting:

    x

    started_at

    character

    start_station_name

    character

    member_type

    character

    duration

    numeric

    noofbikes

    integer

    temperature

    numeric

    feelsliketemp

    numeric

    dew

    numeric

    humidity

    numeric

    windspeed

    numeric

    uvindex

    integer

    weather

    character

    weekday

    character

    holiday

    character

    season

    character

    date

    integer

    month

    integer

    year

    integer

  • The Summary of data set before cleaning and Formatting:

    started_at

    start_station_name

    member_type

    duration

    noofbikes

    temperature

    feelsliketemp

    dew

    humidity

    windspeed

    uvindex

    weather

    weekday

    holiday

    season

    date

    month

    year

    Length:1609103

    Length:1609103

    Length:1609103

    Min. :-1737539

    Min. : 1.00

    Min. :-9.70

    Min. :-16.40

    Min. :-19.20

    Min. :19.00

    Min. : 4.30

    Min. : 0.000

    Length:1609103

    Length:1609103

    Length:1609103

    Length:1609103

    Min. : 1.00

    Min. : 1.00

    Min. :2010

    Class :character

    Class :character

    Class :character

    1st Qu.: 625

    1st Qu.: 3.00

    1st Qu.:26.60

    1st Qu.: 25.50

    1st Qu.: 17.20

    1st Qu.:53.50

    1st Qu.:12.30

    1st Qu.: 4.000

    Class :character

    Class :character

    Class :character

    Class :character

    1st Qu.: 8.00

    1st Qu.: 4.00

    1st Qu.:2015

    Mode :character

    Mode :character

    Mode :character

    Median : 845

    Median : 11.00

    Median :50.60

    Median : 49.20

    Median : 37.10

    Median :64.00

    Median :15.80

    Median : 6.000

    Mode :character

    Mode :character

    Mode :character

    Mode :character

    Median :16.00

    Median : 7.00

    Median :2018

    NA

    NA

    NA

    Mean : 1399

    Mean : 21.88

    Mean :48.46

    Mean : 47.43

    Mean : 36.55

    Mean :63.56

    Mean :17.37

    Mean : 5.805

    NA

    NA

    NA

    NA

    Mean :15.72

    Mean : 6.58

    Mean :2018

    NA

    NA

    NA

    3rd Qu.: 1278

    3rd Qu.: 30.00

    3rd Qu.:71.70

    3rd Qu.: 71.60

    3rd Qu.: 59.10

    3rd Qu.:73.90

    3rd Qu.:21.20

    3rd Qu.: 8.000

    NA

    NA

    NA

    NA

    3rd Qu.:23.00

    3rd Qu.:10.00

    3rd Qu.:2021

    NA

    NA

    NA

    Max. : 7592116

    Max. :1163.00

    Max. :92.90

    Max. :103.20

    Max. : 76.90

    Max. :98.10

    Max. :58.50

    Max. :10.000

    NA

    NA

    NA

    NA

    Max. :31.00

    Max. :12.00

    Max. :2023

  • Date and time of started_at is formarted as the date month year(Y-m-d)

  • The Blank Spaces in the start_station_name are replaced with the NA

  • Data cleaning and transformation of the member_type column standardizes the capitalization of casual and member categories by replacing inconsistent values with lowercase. It counts the number of rows with the “Unknown” value in the member_type column and removes those rows from the data frame since they can’t be classified. Finally, convering the member_type column to a factor for efficient storage and analysis.

  • Converting duration column to a numeric format and rounding it to two decimal places, and removing rows with negative values.

  • Standardizing weather categories in the CaBi data frame by grouping similar weather conditions together and converting it to a factor for efficient storage and analysis.

Old Value New Value
Partially cloudy Cloudy
Rain, Overcast OvercastRain
Rain, Partially cloudy Rain
Snow, Rain, Overcast Overcast
Snow, Rain, Partially cloudy Cloudy
Snow, Partially cloudy Snow
Snow, Overcast OvercastSnow
Snow, Rain Rain
  • Converting “weekday” column to a factor for efficient storage and analysis.

  • holiday Variable has names of holidays and they are replaced with “holiday” value and the NULL is replaced with “not holiday”.

  • Converting season column to a factor for efficient storage and analysis.

  • Removing the date column from the “CaBi” data frame as it is no longer needed for the analysis.

  • Converting month column to a factor and changing factor levels to month names for easier interpretation.

  • Converting year column to numeric for efficient storage and analysis.

  • Printing a summary of the “CaBi” data frame after Cleaning and Formating

    started_at

    start_station_name

    member_type

    duration

    noofbikes

    temperature

    feelsliketemp

    dew

    humidity

    windspeed

    uvindex

    weather

    weekday

    holiday

    season

    month

    year

    Min. :2010-09-20 00:00:00.00

    Length:1608808

    casual: 344358

    Min. : 1

    Min. : 1.00

    Min. :-9.70

    Min. :-16.40

    Min. :-19.20

    Min. :19.00

    Min. : 4.30

    Min. : 0.000

    Clear : 70997

    Weekday:1158685

    holiday :1005209

    Fall :415069

    October :143030

    Min. :2010

    1st Qu.:2015-12-25 00:00:00.00

    Class :character

    member:1264450

    1st Qu.: 625

    1st Qu.: 3.00

    1st Qu.:26.60

    1st Qu.: 25.50

    1st Qu.: 17.20

    1st Qu.:53.50

    1st Qu.:12.30

    1st Qu.: 4.000

    Cloudy :864081

    Weekend: 450123

    not holiday: 603599

    Spring:400827

    August :141278

    1st Qu.:2015

    Median :2018-10-15 00:00:00.00

    Mode :character

    NA

    Median : 845

    Median : 11.00

    Median :50.60

    Median : 49.20

    Median : 37.10

    Median :64.00

    Median :15.80

    Median : 6.000

    Overcast : 65948

    NA

    NA

    Summer:415276

    March :140401

    Median :2018

    Mean :2018-05-28 16:01:43.94

    NA

    NA

    Mean : 1473

    Mean : 21.88

    Mean :48.46

    Mean : 47.43

    Mean : 36.55

    Mean :63.56

    Mean :17.37

    Mean : 5.806

    OvercastRain:168746

    NA

    NA

    Winter:377636

    July :139873

    Mean :2018

    3rd Qu.:2021-03-06 00:00:00.00

    NA

    NA

    3rd Qu.: 1278

    3rd Qu.: 30.00

    3rd Qu.:71.70

    3rd Qu.: 71.60

    3rd Qu.: 59.10

    3rd Qu.:73.90

    3rd Qu.:21.20

    3rd Qu.: 8.000

    OvercastSnow: 293

    NA

    NA

    NA

    September:138118

    3rd Qu.:2021

    Max. :2023-03-31 00:00:00.00

    NA

    NA

    Max. :7592116

    Max. :1163.00

    Max. :92.90

    Max. :103.20

    Max. : 76.90

    Max. :98.10

    Max. :58.50

    Max. :10.000

    Rain :432429

    NA

    NA

    NA

    May :134607

    Max. :2023

    NA

    NA

    NA

    NA

    NA

    NA

    NA

    NA

    NA

    NA

    NA

    Snow : 6314

    NA

    NA

    NA

    (Other) :771501

    NA

Section 4 : Exploratory Data Analysis

4.1 Bike Rental Demand by Season

The output plot shows that the highest percentage of bike rentals occurred during the summer season, followed by fall, spring, and winter. The chart provides a clear visualization of the differences in bike rental demand across different seasons.

4.2 Bike Rental Demand by Season and Temperature

The below scatter plot shows that bike rental demand is influenced by both temperature and season. It indicates that the demand is highest in the spring and summer, with a peak in May and June, and less in the winter.

4.3 Bike Demand by Weather Conditions

The below plot shows that bike rental demand is higher on cloudy and clear days, followed by rainy days, and lower on days with overcast snow.

4.4 Distribution of Bikes by Weather Condition

The below scatter plot displays the distribution of bike rentals by weather condition, with each point representing the number of bikes rented during a specific weather condition. The plot indicates that the highest demand for bike rentals occurs on cloudy and clear days, followed by rainy and overcast days.

4.5 Creating a correlation matrix and plot of the variables in the CaBi dataset.

The resulting heatmap shows that there are some strong positive correlations between certain variables, such as temperature and feelsliketemp, and weaker correlations between other variables. For example, there is a negative correlation between humidity and windspeed.

noofbikes temperature feelsliketemp dew humidity windspeed uvindex
noofbikes 1.00 0.19 0.19 0.17 -0.02 -0.10 0.19
temperature 0.19 1.00 1.00 0.97 0.21 -0.53 0.38
feelsliketemp 0.19 1.00 1.00 0.98 0.23 -0.52 0.39
dew 0.17 0.97 0.98 1.00 0.43 -0.54 0.27
humidity -0.02 0.21 0.23 0.43 1.00 -0.25 -0.39
windspeed -0.10 -0.53 -0.52 -0.54 -0.25 1.00 -0.07
uvindex 0.19 0.38 0.39 0.27 -0.39 -0.07 1.00

4.6 Bike Demand by Year and Weekday

The plot shows the total number of bikes rented in each year grouped by the weekday. The plot suggests that there is a higher demand for bikes on weekdays compared to weekends, and the demand for bikes has been increasing over the years.

4.7 Percentage of total bikes rented

Aggregating the total number of bikes rented by holiday and creates a bar plot to show the percentage of total bikes rented for each holiday category (Yes or No). The plot also shows the number of bikes rented for each holiday category.

4.8 Bike Rentals by Month

Expect for the colder months like January, February, November and December, all the other months have good average.

Summarizing the total number of bikes rented by month, and creates a pie chart showing the proportion of total bikes rented by month. The pie chart uses a different color for each month.

4.9 Maximum and Minimum Bike Demand by Year and Station

The plot is helpful in identifying the bike stations with the highest and lowest demand for bikes over the years, and it can be useful for bike-sharing companies to make strategic decisions. Companies can focus more on bike stations with high demand, while reducing resources allocated to bike stations with low demand.

4.10 Bike Demand by Year and Holiday Type

The plot shows that the bike demand for holidays is higher than that of non-holidays for most of the years. However, in 2020, there was a sharp drop in bike demand for both holidays and non-holidays, which could be attributed to the COVID-19 pandemic. At the end 2023 has the data of only 3 months so, the noofbikes count is low.

Section 5 : Categorical Variables Test

5.0.1 Data Shorting

Instead of doing the sampling with the large data, we mutated the noofbikes with the group of started_at which was previously did with both started_at and start_station_name. Droped the few variables(started_at,start_station_name,member_type,duration,month,year) with are not required during the modeling.

  • The total noof rows now changes to 4572 from 1608808.

  • Summary of data that is further used in the analysis :

    noofbikes

    temperature

    feelsliketemp

    dew

    humidity

    windspeed

    uvindex

    weather

    weekday

    holiday

    season

    Min. : 21

    Min. :-9.70

    Min. :-16.40

    Min. :-19.20

    Min. :19.00

    Min. : 4.30

    Min. : 0.000

    Clear : 200

    Weekday:3268

    holiday :2861

    Fall :1164

    1st Qu.: 4642

    1st Qu.:35.70

    1st Qu.: 30.30

    1st Qu.: 20.77

    1st Qu.:53.00

    1st Qu.:12.20

    1st Qu.: 4.000

    Cloudy :2379

    Weekend:1304

    not holiday:1711

    Spring:1135

    Median : 7533

    Median :53.40

    Median : 52.40

    Median : 40.25

    Median :63.70

    Median :15.30

    Median : 6.000

    Overcast : 224

    NA

    NA

    Summer:1104

    Mean : 7699

    Mean :51.23

    Mean : 49.96

    Mean : 38.75

    Mean :63.41

    Mean :16.68

    Mean : 5.895

    OvercastRain: 533

    NA

    NA

    Winter:1169

    3rd Qu.:10847

    3rd Qu.:72.40

    3rd Qu.: 72.40

    3rd Qu.: 59.52

    3rd Qu.:73.90

    3rd Qu.:20.12

    3rd Qu.: 8.000

    OvercastSnow: 3

    NA

    NA

    NA

    Max. :19531

    Max. :92.90

    Max. :103.20

    Max. : 76.90

    Max. :98.10

    Max. :58.50

    Max. :10.000

    Rain :1210

    NA

    NA

    NA

    NA

    NA

    NA

    NA

    NA

    NA

    NA

    Snow : 23

    NA

    NA

    NA

5.1 ANOVA for all Categorical Variables

5.1.1 ANOVA for Bike Rentals by Season

Df Sum Sq Mean Sq F value Pr(>F)
season 3 17723779732 5907926577 573.7721 0
Residuals 4568 47035062895 10296642 NA NA

The p-value is less than 0.05 (p < 0.05), which indicates that there is a significant difference in the mean number of bikes rented across different seasons. Therefore, we can reject the null hypothesis that there is no significant difference in the mean number of bikes rented across different seasons. The F-value of 573.8 is also quite large, which further supports the conclusion that the mean number of bikes rented across different seasons is significantly different.

5.1.2 ANOVA for Bike Rentals by Holiday

Df Sum Sq Mean Sq F value Pr(>F)
holiday 1 8900068 8900068 0.6281598 0.4280723
Residuals 4570 64749942560 14168478 NA NA

The p-value for holiday is 0.428, which is greater than 0.05, the typical significance level used in statistical analysis. This means that we fail to reject the null hypothesis that there is no significant difference in the mean number of bikes rented on holidays and non-holidays.

5.1.3 ANOVA for Bike Rentals by weekday

Df Sum Sq Mean Sq F value Pr(>F)
weekday 1 2586542 2586542 0.1825383 0.669221
Residuals 4570 64756256086 14169859 NA NA

The weekday variable has a p-value of 0.669, which is also greater than 0.05. This suggests that there is not enough evidence to reject the null hypothesis that there are no differences in the mean number of bikes rented between groups.

5.1.4 ANOVA for Bike Rentals by weather

Df Sum Sq Mean Sq F value Pr(>F)
weather 6 5372980354 895496726 68.83697 0
Residuals 4565 59385862274 13008951 NA NA

The F-value is 68.84, and the p-value is less than 2e-16, which is much smaller than the significance level of 0.05. This suggests that we can reject the null hypothesis and conclude that there is a significant difference in the mean noofbikes across different levels of weather.

Section 6 Modeling and Prediction

6.1 Linear Regression for Numeric variables Induvidually

6.1.1 temperature vs noofbikes

## 
## Call:
## lm(formula = noofbikes ~ temperature, data = CaBi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9036.6 -2774.4   -23.8  2674.9 13176.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4958.313    126.853   39.09   <2e-16 ***
## temperature   53.497      2.254   23.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3552 on 4570 degrees of freedom
## Multiple R-squared:  0.1098, Adjusted R-squared:  0.1096 
## F-statistic: 563.5 on 1 and 4570 DF,  p-value: < 2.2e-16

6.1.2 feelsliketemp vs noofbikes

## 
## Call:
## lm(formula = noofbikes ~ feelsliketemp, data = CaBi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9214.6 -2687.0    -7.8  2564.5 13181.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4919.579    115.113   42.74   <2e-16 ***
## feelsliketemp   55.640      2.059   27.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3495 on 4570 degrees of freedom
## Multiple R-squared:  0.1378, Adjusted R-squared:  0.1376 
## F-statistic: 730.4 on 1 and 4570 DF,  p-value: < 2.2e-16

6.1.3 Dew vs noofbikes

## 
## Call:
## lm(formula = noofbikes ~ dew, data = CaBi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9115.8 -2772.0   -18.3  2758.2 13144.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5700.21     103.63   55.01   <2e-16 ***
## dew            51.58       2.30   22.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3573 on 4570 degrees of freedom
## Multiple R-squared:  0.09912,    Adjusted R-squared:  0.09893 
## F-statistic: 502.8 on 1 and 4570 DF,  p-value: < 2.2e-16

6.1.4 humidity vs noofbikes

## 
## Call:
## lm(formula = noofbikes ~ humidity, data = CaBi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7650.8 -3042.3  -199.1  3147.9 11663.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8378.319    254.416  32.932  < 2e-16 ***
## humidity     -10.710      3.915  -2.736  0.00625 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3761 on 4570 degrees of freedom
## Multiple R-squared:  0.001635,   Adjusted R-squared:  0.001416 
## F-statistic: 7.483 on 1 and 4570 DF,  p-value: 0.006252

6.1.5 windspeed vs noofbikes

## 
## Call:
## lm(formula = noofbikes ~ windspeed, data = CaBi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7853.6 -3047.6  -145.1  3108.2 11724.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8654.757    149.343   57.95  < 2e-16 ***
## windspeed    -57.301      8.317   -6.89 6.35e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3745 on 4570 degrees of freedom
## Multiple R-squared:  0.01028,    Adjusted R-squared:  0.01006 
## F-statistic: 47.47 on 1 and 4570 DF,  p-value: 6.349e-12

6.1.6 uvindex vs noofbikes

## 
## Call:
## lm(formula = noofbikes ~ uvindex, data = CaBi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8962.9 -2494.0    93.2  2376.8 12489.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3566.54     123.18   28.95   <2e-16 ***
## uvindex       701.04      19.17   36.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3311 on 4570 degrees of freedom
## Multiple R-squared:  0.2263, Adjusted R-squared:  0.2261 
## F-statistic:  1337 on 1 and 4570 DF,  p-value: < 2.2e-16

6.1.7 Table of Linear Regression individually

Predictor Variable Intercept Estimate Adjusted R-squared p-value F-statistic
temperature 5087.7 110.8 0.2382 < 2.2e-16 1227
feelsliketemp 4507.6 95.7 0.2113 < 2.2e-16 1013
dew 5700.2 51.6 0.0989 < 2.2e-16 502.8
humidity 8378.3 -10.7 0.0014 0.00625 7.483
windspeed 8654.8 -57.3 0.0101 6.35e-12 47.47
uvindex 3566.5 701.0 0.2261 < 2.2e-16 1337

6.2 Spliting the Data

  • Spiting the data by 80% training and 20% testing

The total noof observations from the Training data CaBitrain is 3660 observations and in test data set CaBitest is 912 observations

6.3 Step Linear Regression Model for Numeric Variables

## Start:  AIC=58835.95
## noofbikes ~ temperature + dew + humidity + windspeed + uvindex
## 
##               Df  Sum of Sq        RSS   AIC
## <none>                      3.4955e+10 58836
## - windspeed    1   49588986 3.5005e+10 58839
## - humidity     1 2242583738 3.7198e+10 59062
## - temperature  1 3125373823 3.8081e+10 59147
## - dew          1 3359115712 3.8314e+10 59170
## - uvindex      1 4826876083 3.9782e+10 59307
## 
## Call:
## lm(formula = noofbikes ~ temperature + dew + humidity + windspeed + 
##     uvindex, data = CaBitrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9269.2 -2169.1    52.2  2220.3 10792.1 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 22088.690   1239.763  17.817   <2e-16 ***
## temperature  -518.551     28.689 -18.075   <2e-16 ***
## dew           597.028     31.861  18.739   <2e-16 ***
## humidity     -223.698     14.610 -15.311   <2e-16 ***
## windspeed     -20.735      9.107  -2.277   0.0229 *  
## uvindex       604.906     26.929  22.463   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3093 on 3654 degrees of freedom
## Multiple R-squared:  0.3202, Adjusted R-squared:  0.3193 
## F-statistic: 344.2 on 5 and 3654 DF,  p-value: < 2.2e-16

The initial AIC value is 58835.95, and the final AIC value after fitting the linear regression model is 58836. This indicates that the model with all the predictor variables, temperature, dew, humidity, windspeed, and uvindex, is a good fit for the data.

The R-squared value of the model is 0.3202, indicating that 32.02% of the variance in noofbikes can be explained by the predictor variables in the model. The adjusted R-squared value is 0.3193, which adjusts for the number of predictor variables in the model.

Overall, the results suggest that the combination of temperature, dew, humidity, windspeed, and uvindex can be used to predict the number of bikes rented in CaBitrain.

6.4 Linear Regression Model for all Variables

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9441.0 -1930.9   137.3  2095.3 10588.0 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          15817.360   1266.647  12.488  < 2e-16 ***
## temperature           -307.299     29.228 -10.514  < 2e-16 ***
## dew                    339.211     32.946  10.296  < 2e-16 ***
## humidity              -105.848     15.250  -6.941 4.59e-12 ***
## windspeed              -18.276      9.137  -2.000  0.04554 *  
## uvindex                419.655     28.108  14.930  < 2e-16 ***
## weatherCloudy          265.716    243.519   1.091  0.27528    
## weatherOvercast      -1056.564    335.119  -3.153  0.00163 ** 
## weatherOvercastRain  -1685.304    311.890  -5.404 6.95e-08 ***
## weatherOvercastSnow  -2679.742   1702.874  -1.574  0.11565    
## weatherRain           -539.624    268.745  -2.008  0.04472 *  
## weatherSnow          -1219.139    770.361  -1.583  0.11361    
## weekdayWeekend         -56.367    106.984  -0.527  0.59832    
## `holidaynot holiday`   -54.526    100.847  -0.541  0.58876    
## seasonSpring          -623.145    146.181  -4.263 2.07e-05 ***
## seasonSummer           616.347    153.350   4.019 5.96e-05 ***
## seasonWinter         -2687.742    154.018 -17.451  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2913 on 3643 degrees of freedom
## Multiple R-squared:  0.3988, Adjusted R-squared:  0.3961 
## F-statistic:   151 on 16 and 3643 DF,  p-value: < 2.2e-16

The first model, which includes all five independent variables (temperature, dew, humidity, windspeed, and uvindex), has an adjusted R-squared value of 0.3193, indicating that these variables explain 31.93% of the variability in the dependent variable. The F-statistic is significant (p-value < 2.2e-16), indicating that the model is a good fit.

The second model includes all of the independent variables from the first model, as well as additional categorical variables (weather, weekday, holiday, and season). This model has an adjusted R-squared value of 0.3961, indicating that these variables explain 39.61% of the variability in the dependent variable. The F-statistic is also significant (p-value < 2.2e-16), indicating that the model is a good fit.

In both models, temperature, dew, humidity, windspeed, and uvindex are all significant predictors of noofbikes. Additionally, certain categories of the categorical variables are also significant predictors (i.e., weatherOvercast, weatherOvercastRain, weatherRain, seasonSpring, seasonSummer, and seasonWinter). The other categorical variables (weatherCloudy, weatherOvercastSnow, weekdayWeekend, and holidayNot holiday) are not significant predictors of noofbikes.

6.5 Decision Tree

6.5.1 Tuning Regression Decision Tree

The resampling was performed using a 5-fold cross-validation with 3 repetitions, resulting in a total of 15 iterations. The summary of sample sizes indicates that each fold had approximately the same number of samples.

The results show that the optimal value for maxdepth is 8, which gives an RMSE of 2959.329, an R-squared of 0.3784168, and an MAE of 2397.165.

6.5.2 Decision Tree plot

6.5.3 Importance of the variables from the decision tree

The most important feature is temperature with a value of 100, followed by dew with a value of 65.81, uvindex with a value of 38.52, and so on. The least important feature in this model appears to be windspeed with a value of 7.44.

6.5.4 Prediction on test data set CaBitest

R2 RMSE MAE
0.3665989 3044.444 2440.491

Decision Tree with 36.65% of the variability in the target variable. The root mean squared error (RMSE) is 3044.444 . The mean absolute error (MAE) is 2440.491.

6.6 Bagged Decision Tree

With the tuned Decision Tree the R2 is low, for which trying the bagged Decision Tree could help in increase of R2

6.6.1 Importance of the variables from the Baggged decision tree

Temperature has the highest importance score of 100, followed by dew, UV index, humidity, and season. Other variables such as windspeed, weather, and weekday/weekend have relatively lower importance scores, indicating that they have a lesser impact on the target variable.

6.6.2 Prediction of Baggged decision tree on test data set CaBitest

R2 RMSE MAE
0.4031066 2957.301 2388.401

The R2 score is 0.4016212, which means that the predictor variables explain around 40% of the variance in the target variable.The RMSE is 2961.255, which means that on average, the predicted values are about 2961.255 units away from the actual values.

6.7 Random Forest

6.7.1 Important Variables with the Random Forest Model

6.7.2 Random Forest Predicitions on test data set CaBitest

R2 RMSE MAE
0.5086464 2682.561 2117.371

The R2 score is 0.5086464, which means that the predictor variables explain around 51% of the variance in the target variable.The RMSE is 2682.561, which means that on average, the predicted values are about 2682.561 units away from the actual values.

Section 7 : Modeling with non-COVID years

7.1 Spliting the Data by removing covid years

From the Above models and results the Accuracy is only 50%. For this the data is randomly split into 80% and 20%. So, Lets remove the covid years from Data set.

Choosing the training data set with years 2010 to 2018 and testing data with years 2022 and 2023. Here 2019 2020 2021 years were ignored for which they are the covid years.

  • The total noof observations from the noncovid years Training data CaBitraincovid is 3476 observations and in test data set CaBitestcovid is 455 observations

7.2 Tuning Decision Tree

7.2.1 Tuning Regression Decision Tree for Max Tree Depth

The resampling was performed using a 5-fold cross-validation with 3 repetitions, resulting in a total of 15 iterations. The summary of sample sizes indicates that each fold had approximately the same number of samples.

The results show that the optimal value for maxdepth is 8, which gives an RMSE of 2959.329, an R-squared of 0.3784168, and an MAE of 2397.165.

7.2.2 Tuning Regression Decision Tree plot for prediction

7.2.3 Important Variables from the decision tree

The most important feature is temperature with a value of 100, followed by dew with a value of 65.81, uvindex with a value of 38.52, and so on. The least important feature in this model appears to be windspeed with a value of 7.44.

7.2.4 Prediction this on test data set CaBitestcovid

R2 RMSE MAE
0.5849981 2442.709 1957.982

Decision Tree with 58.5% of the variability in the target variable. The root mean squared error (RMSE) is 2442.709. The mean absolute error (MAE) is 1957.982.

7.3 Bagged Decision Tree

7.3.1 Importance of the variables from the Baggged decision tree

With the tuned Decision Tree the R2 is low, for which trying the bagged Decision Tree could help in increase of R2

we can see that temperature, dew, and seasonSummer are the top three variables that are most important for predicting the outcome variable. The importance of these variables decreases as we move down the list. Variables with importance measures close to zero are unlikely to contribute much to the model’s predictive power.

7.3.2 Prediction of Baggged decision tree on test data set CaBitestcovid

R2 RMSE MAE
0.6583164 2206.154 1762.025

The R2 value has increased to 0.6656262, which indicates a better fit of the model to the data. Additionally, the RMSE value has decreased to 2245.789, and the MAE value has decreased to 1796.862. These values suggest that the model is performing better in terms of predicting the number of bike rentals based on the input features.

7.4 Random Forest

7.4.1 Important Variables with the Random Forest Model

7.4.2 Random Forest Predicitions on test data set CaBitestcovid

R2 RMSE MAE
0.9633303 759.8413 581.2308

The Random Forest model has a higher R2 value and lower RMSE and MAE values compared to the regression tree. This suggests that the Random Forest model will be a better fit for the data and have better predictive performance.

Section 8 : Predication model with manual input.

  • Try this section on Rstudio

Section 9 : Conclusion

  • ANOVA season and weather are important factors in predicting the number of bikes rented, while holiday and weekday do not have a significant effect on the number of bikes rented.

  • Linear Regression All the variables included in the model have a significant impact on the number of bikes used.

9.1 Spliting With Sampling of 80% Training and 20% Testing data sets

  • Decision Tree temperature is the most important variable with an importance score of 100. Dew and UV index are the next most important variables, followed by season (Summer, Spring, Winter), humidity, weather (OvercastRain, Cloudy), and windspeed, in decreasing order of importance.

  • Random Forest the model explains 58.86% of the variance in the response variable with the important variables of temperature, followed by season, humidity, dew, uvindex, and weather.

9.2 Spliting the Data by removing covid years

  • Decision Tree The most important feature is temperature with a value of 100, followed by dew with a value of 65.81, uvindex with a value of 38.52, and so on. The least important feature in this model appears to be windspeed with a value of 7.44.

  • Random Forest the model explains 96.33% of the variance in the response variable with the important variables of temperature, followed by season,humidity, dew, uvindex, and weather.

9.3 Comparisson Table for two types of data splits

Model Type Sample Split R2 - R-squared Sample Split Root Mean Square Error Sample Split Mean Absolute Error Non-COVID years Split R2 - R-squared Non-COVID years Split Root Mean Square Error Non-COVID years Split Mean Absolute Error
Tunned Decision Tree 0.3665989 3044.444 2440.491 0.5849981 2442.709 1957.982
Bagged Decision Tree 0.4031066 2957.301 2388.401 0.6583164 2206.154 1762.025
Random Forest 0.5086464 2682.561 2117.371 0.9633303 759.8413 581.2308
  • Random Forest model performs the best out of the three models for both the sample split and the non-COVID years split. It has the highest R-squared value, and the lowest RMSE and MAE values. The Bagged Decision Tree model also performs reasonably well, with R-squared values over 0.4 and relatively low RMSE and MAE values. The Tuned Decision Tree model, on the other hand, has lower R-squared values and higher RMSE and MAE values than the other two models.

  • The COVID-19 pandemic was an unexpected circumstance that impacted bike trip demand in ways that may not have been accounted for in the models developed in this project. As a result, the models’ predictions for bike trip demand during the pandemic may be less accurate than their predictions for non-COVID years.

  • This highlights the importance of considering potential unforeseen circumstances and their potential impact on the accuracy of predictive models. It also underscores the importance of regularly updating and retraining models as new data becomes available, in order to account for changes in the underlying data and any new unforeseen circumstances that may arise.

Section 10 : References

GitHub : https://github.com/mohiddin7/Final_project_DATS6101_SIM